1 Graphics in R

Now we’ll see one of R’s premier packages in action when graphing data. Let us load the hsb2.RData we saved earlier.

load("~/Downloads/Archive/hsb2.RData")

ggplot2 is one of the leading R packages for graphics, followed closely by lattice. Let us work with ggplot2 first and fit some simple graphs. Note that there is extensive help available for ggplot2 on the web. You can start with the Cookbook for R or the ggplot2 documentation. You can also search on stackoverflow.

1.1 The Mechanics of ggplot2

ggplot2 uses the grammar of graphics to build graphs by breaking up each graph into three components – data, aesthetics, and geometry. You specify the data frame with the data command, then the x and y coordinates with the aes command, and finally the geometry (bar-chart, histogram, etc.) via the geom_ command. The geometry for some of the graphs we will use most often is listed below:

  • geom_bar() – bar-chart
  • geom_histogram() – histogram
  • geom_line() – line chart
  • geom_point() – scatte plot
  • geom_density() – density plots
  • geom_jitter() – stripcharts

2 Constructing Graphs

Recall that for numeric variables we can rely on box-plots and histograms to explore the distribution of a numeric (scale) variable. Perhaps we are interested in reading scores and want to start with a histogram.

2.1 Histograms

library(ggplot2)
ggplot(data=hsb2, aes(x=read)) + geom_histogram()

You see R telling you that stat_bin() using bins = 30. Pick better value with binwidth.. That is, R is automatically grouping read in a way that there are 30 groups. Maybe we want fewer groups, maybe 10. This can be done as follows:

ggplot(data=hsb2, aes(x=read)) + geom_histogram(bins = 10)

We can customize this histogram further, changing the colors, the labels for the x-axis, the y-axis, adding a title, and so on.

ggplot(data=hsb2, aes(x=read)) + geom_histogram(bins = 10, fill="cornflowerblue") + ggtitle("Histogram of Reading Scores") + xlab("Reading Score") + ylab("Frequency")

ggplot(data=hsb2, aes(x=read)) + geom_histogram(bins = 10, fill="salmon") + ggtitle("Histogram of Reading Scores") + xlab("Reading Score") + ylab("Frequency")

ggplot(data=hsb2, aes(x=read)) + geom_histogram(bins = 10, fill="deeppink1") + ggtitle("Histogram of Reading Scores") + xlab("Reading Score") + ylab("Frequency")

ggplot(data=hsb2, aes(x=read)) + geom_histogram(bins = 10, fill="yellowgreen") + ggtitle("Histogram of Reading Scores") + xlab("Reading Score") + ylab("Frequency")

Note: A small snippet of the wide expanse of colors available in R can be seen here and you can always brew your own color palette (ask me and I’ll give you the code).

What if wanted to construct these histograms for male versus female students, or for each of the SES groups?

ggplot(data=hsb2, aes(x=read)) + geom_histogram(bins = 10, fill="tomato") + ggtitle("Histogram of Reading Scores") + xlab("Reading Score") + ylab("Frequency") + facet_wrap(~ female)

ggplot(data=hsb2, aes(x=read)) + geom_histogram(bins = 10, fill="tomato") + ggtitle("Histogram of Reading Scores") + xlab("Reading Score") + ylab("Frequency") + facet_wrap(~ ses)

What if we wanted to break it out by female/male students in public versus private schools?

ggplot(data=hsb2, aes(x=read)) + geom_histogram(bins = 10, fill="tomato") + ggtitle("Histogram of Reading Scores") + xlab("Reading Score") + ylab("Frequency") + facet_wrap(female ~ schtyp)

2.2 Box-plots

Now we can revisit the preceding distributions, albeit with box-plots .

ggplot(data=hsb2, aes(x=female, y=read)) + geom_boxplot(fill="seagreen2") + ggtitle("Box-Plot of Reading Scores") + xlab("Gender") + ylab("Reading Score") + coord_flip() 

ggplot(data=hsb2, aes(x=female, y=read)) + geom_boxplot(fill="peachpuff") + ggtitle("Box-Plot of Reading Scores (by Gender & School Type)") + xlab("Gender") + ylab("Reading Score") + coord_flip() + facet_wrap(~ schtyp)

2.3 Bar-Charts

Recall the bar-charts we used for qualitative variables last semester. Let us generate a few for gender, schtyp, prog, ses, and race.

ggplot(data=hsb2, aes(female)) + geom_bar(fill="seagreen2") + ggtitle("Bar-Chart of Gender") + xlab("Gender") + ylab("Frequency") + theme(axis.text.x=element_text(angle = 90, hjust = 0))

ggplot(data=hsb2, aes(race)) + geom_bar(fill="seagreen2") + ggtitle("Bar-Chart of Race (by School Type)") + xlab("Race") + ylab("Frequency") + facet_wrap(~ schtyp) + theme(axis.text.x=element_text(angle = 90, hjust = 0))

ggplot(data=hsb2, aes(race)) + geom_bar(fill="seagreen2") + ggtitle("Bar-Chart of Race (by SES & School Type)") + xlab("Race") + ylab("Frequency") + facet_wrap(ses ~ schtyp) + theme(axis.text.x=element_text(angle = 90, hjust = 0))

ggplot(data=hsb2, aes(race)) + geom_bar(fill="seagreen2") + ggtitle("Bar-Chart of Race (by SES & School Type)") + xlab("Race") + ylab("Frequency") + facet_wrap(ses ~ schtyp, ncol=2) + theme(axis.text.x=element_text(angle = 90, hjust = 0))

2.4 Line Charts

If you have data over time then line charts are a good way to show trends over time.

library(plotly)
plot_ly(economics, x = date, y = uempmed, name = "unemployment")

plotly is a special graphics package for interactive graphics so don’t think this is how the typical line chart might look. For example, the same plot rendered via ggplot2 would look as follows:

ggplot(data=economics, aes(x=date, y=uempmed)) + geom_line()

Regardless of the package-specific rendering, the basic point should be obvious: You can see how unemployment varies over time.

2.5 Scatter-plots

If we have to numeric (scale) variables then a scatter-plot is a great way to explore if and how these two variables are related.

ggplot(data=iris, aes(x=Sepal.Length, y = Petal.Width, color=Species)) + geom_point()

ggplot(data=mtcars, aes(x=qsec, y = mpg, color=factor(cyl))) + geom_point()

2.6 Density Plots

ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) + geom_density(alpha=0.3)

2.7 Stripcharts

ggplot(data=iris, aes(y=Sepal.Length, x=Species, color=Species)) + geom_jitter()

ggplot(data=iris, aes(y=Sepal.Length, x=Species, color=Species)) + geom_jitter() + stat_summary(fun.y  = median, geom="point", size=3, color="black")

ggplot(data=iris, aes(y=Sepal.Length, x=Species, color=Species)) + geom_jitter() + stat_summary(fun.data  = "mean_sdl", geom="pointrange", size=0.5, color="black")

ggplot(data=iris, aes(y=Sepal.Length, x=Species, color=Species)) + geom_boxplot() + geom_jitter() 

3 A Teaser on Mapping with ggplot2 and ggmap

I’ll leave you with a few maps, first of the 48 states on the continent, then of all counties in the country, then one of counties in Ohio, and finally a googlemap of Athens.

library(maps)
library(ggmap)

states = map_data("state")

ggplot() + geom_polygon(data = states, aes(x=long, y = lat, group = group, fill=region)) + coord_fixed(1.3) + guides(fill=FALSE)

counties <- map_data("county")

ggplot() + geom_polygon(data = counties, aes(x=long, y = lat, group = group, fill=region)) + coord_fixed(1.3) + guides(fill=FALSE)

ohio = subset(counties, region == "ohio", )

ggplot() + geom_polygon(data = ohio, aes(x=long, y = lat, group = group, fill=subregion)) + coord_fixed(1.3) + guides(fill=FALSE)

athens = get_map(location = "Athens, Ohio", zoom=14, source="osm")
ggmap(athens)

whitehouse = get_map(location = "The White House, Washington DC", zoom=16, source="osm")
ggmap(whitehouse)

athens = get_map(location = "Athens, Ohio", zoom=14)
ggmap(athens)